Open Geospatial Consortium |
Submission Date: <2023-03-07> |
Approval Date: <yyyy-mm-dd> |
Publication Date: <yyyy-mm-dd> |
External identifier of this OGC® document: http://www.opengis.net/doc/{doc-type}/{standard}/{m.n} |
Internal reference number of this OGC® document: 23-008r1 |
Version: 1.0.0 |
Editor: Peng Yue, Boyi Shangguan |
OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Part1: Conceptual Model Standard |
Copyright notice |
Copyright © <year> Open Geospatial Consortium |
To obtain additional rights of use, visit http://www.ogc.org/legal/ |
Warning |
This document is not an OGC Standard. This document is distributed for review and comment. This document is subject to change without notice and may not be referred to as an OGC Standard.
Recipients of this document are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.
Document type: OGC® Standard |
Document subtype: Conceptual Model |
Document stage: Draft |
Document language: English |
License Agreement
Permission is hereby granted by the Open Geospatial Consortium, ("Licensor"), free of charge and subject to the terms set forth below, to any person obtaining a copy of this Intellectual Property and any associated documentation, to deal in the Intellectual Property without restriction (except as set forth below), including without limitation the rights to implement, use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Intellectual Property, and to permit persons to whom the Intellectual Property is furnished to do so, provided that all copyright notices on the intellectual property are retained intact and that each person to whom the Intellectual Property is furnished agrees to the terms of this Agreement.
If you modify the Intellectual Property, all copies of the modified Intellectual Property must include, in addition to the above copyright notice, a notice that the Intellectual Property includes modifications that have not been approved or adopted by LICENSOR.
THIS LICENSE IS A COPYRIGHT LICENSE ONLY, AND DOES NOT CONVEY ANY RIGHTS UNDER ANY PATENTS THAT MAY BE IN FORCE ANYWHERE IN THE WORLD.
THE INTELLECTUAL PROPERTY IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE DO NOT WARRANT THAT THE FUNCTIONS CONTAINED IN THE INTELLECTUAL PROPERTY WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION OF THE INTELLECTUAL PROPERTY WILL BE UNINTERRUPTED OR ERROR FREE. ANY USE OF THE INTELLECTUAL PROPERTY SHALL BE MADE ENTIRELY AT THE USER’S OWN RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR ANY CONTRIBUTOR OF INTELLECTUAL PROPERTY RIGHTS TO THE INTELLECTUAL PROPERTY BE LIABLE FOR ANY CLAIM, OR ANY DIRECT, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM ANY ALLEGED INFRINGEMENT OR ANY LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR UNDER ANY OTHER LEGAL THEORY, ARISING OUT OF OR IN CONNECTION WITH THE IMPLEMENTATION, USE, COMMERCIALIZATION OR PERFORMANCE OF THIS INTELLECTUAL PROPERTY.
This license is effective until terminated. You may terminate it at any time by destroying the Intellectual Property together with all copies in any form. The license will also terminate if you fail to comply with any term or condition of this Agreement. Except as provided in the following sentence, no such termination of this license shall require the termination of any third party end-user sublicense to the Intellectual Property which is in force as of the date of notice of such termination. In addition, should the Intellectual Property, or the operation of the Intellectual Property, infringe, or in LICENSOR’s sole opinion be likely to infringe, any patent, copyright, trademark or other right of a third party, you agree that LICENSOR, in its sole discretion, may terminate this license without any compensation or liability to you, your licensees or any other party. You agree upon termination of any kind to destroy or cause to be destroyed the Intellectual Property together with all copies in any form, whether held by you or by any third party.
Except as contained in this notice, the name of LICENSOR or of any other holder of a copyright in all or part of the Intellectual Property shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Intellectual Property without prior written authorization of LICENSOR or such copyright holder. LICENSOR is and shall at all times be the sole entity that may authorize you or any third party to use certification marks, trademarks or other special designations to indicate compliance with any LICENSOR standards or specifications. This Agreement is governed by the laws of the Commonwealth of Massachusetts. The application to this Agreement of the United Nations Convention on Contracts for the International Sale of Goods is hereby expressly excluded. In the event any provision of this Agreement shall be deemed unenforceable, void or invalid, such provision shall be modified so as to make it valid and enforceable, and as so modified the entire Agreement shall remain in full force and effect. No decision, action or inaction by LICENSOR shall be construed to be a waiver of any rights or remedies available to it.
- 1. Scope
- 2. Conformance
- 3. Normative References
- 4. Terms and Definitions
- 5. Conventions
- 6. Overview
- 7. TrainingDML-AI UML Model
- 8. TrainingDML-AI Data Dictionary
- Appendix A: Abstract Test Suite (Normative)
- Appendix B: Example (Informative)
- Appendix C: Revision History (Informative)
- Appendix D: Bibliography
i. Abstract
The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard aims to develop the UML model and encodings for geospatial machine learning training data. Training data plays a fundamental role in Earth Observation (EO) Artificial Intelligence Machine Learning (AI/ML), especially Deep Learning (DL). It is used to train, validate, and test AI/ML models. This Standard defines a UML model and encodings consistent with the OGC Standards baseline to exchange and retrieve the training data in the Web environment.
The TrainingDML-AI Standard provides detailed metadata for formalizing the information model of training data. This includes but is not limited to the following aspects:
-
How the training data is prepared, such as provenance or quality;
-
How to specify different metadata used for different ML tasks such as scene/object/pixel levels;
-
How to differentiate the high-level training data information model and extended information models specific to various ML applications;
-
How to introduce external classification schemes and flexible means for representing ground truth labeling.
ii. Keywords
The following are keywords to be used by search engines and document catalogues.
ogcdoc, OGC document, artificial intelligence, machine learning, deep learning, earth observation, remote sensing, training data, training sample, UML
iii. Preface
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. The Open Geospatial Consortium shall not be held responsible for identifying any or all such patent rights.
iv. Security Considerations
No security considerations have been made for this Standard.
v. Submitting organizations
The following organizations submitted this Document to the Open Geospatial Consortium (OGC):
-
Wuhan University
-
Pixalytics Ltd
-
National Geospatial-Intelligence Agency
-
George Mason University
-
Laboratoire d’Informatique de Grenoble
-
South Digital Technology Co., Ltd
-
Wuhan University of Technology
-
Hubei University
-
Chongqing Changan Automobile Co., Ltd
vi. Submitters
All questions regarding this submission should be directed to the editor or the submitters:
Name |
Affiliation |
Peng Yue |
Wuhan University |
Jianya Gong |
Wuhan University |
Ruixiang Liu |
Wuhan University |
Dayu Yu |
Wuhan University |
Samantha Lavender |
Pixalytics Ltd |
Jim Antonisse |
National Geospatial-Intelligence Agency |
Liping Di |
George Mason University |
Eugene Yu |
George Mason University |
Danielle Ziébelin |
Laboratoire d’Informatique de Grenoble |
Boyi Shangguan |
South Digital Technology Co., Ltd |
Lei Hu |
South Digital Technology Co., Ltd |
Liangcun Jiang |
Wuhan University of Technology |
Mingda Zhang |
Hubei University |
Kai Yan |
Chongqing Changan Automobile Co., Ltd |
vii. Acknowledgments
Thanks to the members of the TrainingDML-AI Standards Working Group of the OGC as well as all contributors of change requests and comments. In particular: Scott Simmons, Carl Reed, Ivana Ivánová, Emily Daemen, Jon Duckworth, Zheheng Liang, Jibo Xie, Yuqi Bai, Winnie Shiu, Ignacio Correas, Chenxiao Zhang, Zhipeng Cao, Haofeng Tan, Yinyin Pan, Hanwen Xu, Shuaiqi Liu, Hao Li, Ming Wang, Kaixuan Wang, Haipeng Deng.
1. Scope
Training data is the building block of machine learning models. These models now constitute the majority of machine learning applications in Earth science. Training data is used to train AI/ML models, and to then validate model results. Formalizing and documenting the training data by characterizing the training data content, metadata, data quality, and provenance, and so forth is essential.
This OGC Training Data Standard (draft) describes work actions around training data:
-
Documents the UML model with a target of maximizing the interoperability and usability of geospatial training data;
-
Defines different AI/ML tasks and labels in earth observation, including scene level, object level and pixel level tasks;
-
Describes the description of the permanent identifier, version, license, training data size, measurement or imagery used for annotation, and so on;
-
Defines the description of quality (e.g., training data errors, training data unrepresentativeness) and the provenance (labeling, labeler, and labeling procedure).
2. Conformance
This TrainingDML-AI Standard defines a conceptual model that is independent of any encoding or formatting technologies. The standardization targets for this Standard is:
-
TrainingDML-AI Conceptual Model
Conformance with this Standard shall be checked using all the relevant tests specified in Annex A (normative) of this document. The framework, concepts, and methodology for testing, and the criteria to be achieved to claim conformance are specified in the OGC Compliance Testing Policies and Procedures and the OGC Compliance Testing web site.
All requirements-classes and conformance-classes described in this document are owned by the standard identified.
3. Normative References
The following normative documents contain provisions that, through reference in this text, constitute provisions of this document. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. For undated references, the latest edition of the normative document referred to applies.
-
ISO 19107:2019 Geographic information — Spatial schema
-
ISO 19115-1:2014 Geographic information — Metadata — Part 1: Fundamentals
-
ISO 19157-1 Geographic information — Data quality — Part 1: General requirements
4. Terms and Definitions
This document used the terms defined in OGC Policy Directive 49, which is based on the ISO/IEC Directives, Part 2, Rules for the structure and drafting of International Standards. In particular, the word “shall” (not “must”) is the verb form used to indicate a requirement to be strictly followed to conform to this Standard and OGC documents do not use the equivalent phrases in the ISO/IEC Directives, Part 2.
For the purposes of this document, the following additional terms and definitions apply.
4.1. Artificial Intelligence (AI)
refers to a set of methods and technologies that can empower machines or software to learn and perform tasks like humans.
4.2. Machine Learning (ML)
is an important branch of artificial intelligence. ML processes create models from training data by using a set of learning algorithms, and then can use these models to make predictions. Depending on whether the training data include labels, the learning algorithms can be divided into supervised and unsupervised learning.
4.3. Deep Learning (DL)
is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
4.4. Training Dataset
a collection of samples, often labelled in terms of supervised learning. A training dataset can be divided into training, validation, and test sets. Training samples are different from samples in OGC Observations & Measurements (O&M). They are often collected in purposive ways that deviate from purely probability sampling, with known or expected results labelled as values of a dependent variable for generating a trained predictive model.
4.5. Label
refers to known or expected results annotated as values of a dependent variable in training samples. A training sample label is different from those on a geographical map, which are known as map labels or annotations.
4.6. Provenance
information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. In this standard provenance is a record of how training data were prepared.
SOURCE: W3C (https://www.w3.org/TR/prov-overview/)
4.7. Quality
degree to which a set of inherent characteristics fulfils requirements [ISO 9000:2005, definition3.1.1]. Quality of training data (such as data imbalance and mislabeling) can impact the performance of AI/ML models.
4.8. Earth Observation
data and information collected about our planet, whether atmospheric, oceanic or terrestrial. This includes space-based or remotely-sensed data, as well as ground-based or in situ data.
SOURCE: GEO (https://earthobservations.org/geo_wwd.php)
4.9. Scene Classification
task of identifying scene categories of images, on the basis of a training set of images whose scene categories are known.
4.10. Object Detection
task of recognizing objects such as cars from images. The objects are often localized using bounding boxes.
4.11. Semantic Segmentation
task of assigning class labels to pixels of images or points of point clouds.
5. Conventions
This section provides details and examples for any conventions used in the document. Examples of conventions are symbols, abbreviations, use of XML schema, or special notes regarding how to read the document.
5.1. Identifiers
The normative provisions in this specification are denoted by the URI:
All requirements and conformance tests that appear in this document are denoted by partial URIs which are relative to this base.
5.2. Abbreviated terms
In this document the following abbreviations and acronyms are used or introduced:
-
AI — Artificial Intelligence
-
DL — Deep Learning
-
EO — Earth Observation
-
ISO — International Organization for Standardization
-
JSON — JavaScript Object Notation
-
LC — Land Cover
-
LU — Land Use
-
ML — Machine Learning
-
OGC — Open Geospatial Consortium
-
RS — Remote Sensing
-
TD — Training Data
-
UML — Unified Modelling Language
-
XML — Extensible Markup Language
5.3. UML Notation
The Standard is presented in this document through diagrams using the Unified Modeling Language (UML) static structure diagram. The UML notations used in this Standard are described in the diagram in Figure 1.
All associations between model elements in the TrainingDML-AI Conceptual Model are uni-directional. Thus, associations in the model are navigable in only one direction. The direction of navigation is depicted by an arrowhead. In general, the context an element takes within the association is indicated by its role. The role is displayed near the target of the association. If the graphical representation is ambiguous though, the position of the role has to be drawn to the element the association points to.
The following stereotypes are used in this model.
-
«DataType» defines a set of properties that lack identity. A data type is a classifier with no operations, whose primary purpose is to hold information.
-
«CodeList» enumerates the valid attribute values. In contrast to Enumeration, the list of values is open and, thus, not given inline in the TrainingDML-AI UML Model. The allowed values can be provided within an external code list.
6. Overview
The TrainingDML-AI Conceptual Model Standard defines how to represent and exchange ML training data. The conceptual model includes the most relevant training data entities from datasets, to instances (i.e. individual training samples), to labels. The conceptual schema specifies how and into which parts of the training data should be decomposed and classified.
The TrainingDML-AI conceptual model (Clause 7) is formally specified using UML class diagrams, complemented by a data dictionary (Clause 8) providing the definitions and explanations of the object classes and attributes. This conceptual model provides the basis for specifying encoding implemented in languages such as JSON, or XML.
6.1. AI tasks for EO
In recent years AI/ML is increasingly used in the EO domain. The new AI/ML algorithms frequently require large training datasets as benchmarks. AI/ML TD have been used in many EO applications to calibrate the performance of AI/ML models. Many efforts have been made to produce training datasets to make accurate predictions. As a result, a number of training datasets are publicly available, with new datasets being constantly released. In the EO domain, examples of AI/ML training datasets have been developed in various tasks including the following typical scenarios:
-
Scene classification. These algorithms determine image categories from numerous pictures (e.g., agricultural, forest, and beach scenes). The training samples are a series of labelled pictures. The data can be either from satellite, drones, or aircrafts. The metadata of the datasets often includes the number of training samples, the number of classes, and the image size.
-
Object detection. These algorithms detect and localize different objects (e.g., airplanes, cars and building) in a single image. The image can be optical or non-optical, such as Synthetic Aperture Radar (SAR). Recent work also suggests an increasing focus on object detection from street view imagery. Objects can be labelled with two forms of bounding boxes, i.e., oriented and horizontal bounding boxes. The geometry of a bounding box can be expressed using top-left/bottom-right coordinates, coordinates of four corners, or center coordinates along with the length and width of the box.
-
Semantic segmentation. In terms of Land cover (LC) and land use (LU) classification, this process assigns a LC/LU class label to a pixel (or groups of pixels) of RS imagery. Considering semantic segmentation of 3D point clouds, it is to classify points of a 3D point cloud into categories. TDs are usually composed of RS images/point clouds, and the corresponding labelled value of each pixel/point recording its class.
-
Change detection. These algorithms identify the difference between images acquired over the same geographical area but taken at different times. The TD comprise a set of pre-change and post-change RS images, with the corresponding ground truth map labelled changed and unchanged pixels. The image can be optical or SAR images.
-
3D model reconstruction. These algorithms infer the 3D geometry and structure of objects and scenes, mainly realized from the dense matching of multi-view images. The TD are usually composed of two-view or multi-view images, with the corresponding disparity map or depth maps as ground truth respectively.
6.2. Modularization
The TrainingDML-AI conceptual model provides models for the most important elements within TD. These elements have been identified to be either required or important in many different AI/ML tasks. However, implementations are not required to support the complete TrainingDML-AI model in order to be conformant to the Standard. Implementations may employ a subset of constructs according to their specific information needs. For this purpose, modularization is applied to the TrainingDML-AI.
As shown in Figure 2, the TrainingDML-AI conceptual model is thematically decomposed into a Basic module, a Provenance module, a Quality module and a Changeset module. The Basic module comprises the basic concepts and elements, including AI_TrainingDataset, AI_TrainingData, AI_Label, and AI_Task, of the TrainingDML-AI, and thus, must be implemented by any conformant system. The Provenance module provides a comprehensive definition of provenance by AI_Labeling, AI_Labeler, and AI_Labeling Procedure. The Quality module offers quality description of TD with AI_DataQuality elements. And the Changeset module defines AI_TDChangeset between versions of datasets.
6.3. General modeling principles
6.3.1. Element modeling
The modeling of all elements in the TrainingDML-AI conceptual model has the following principles (Reference [7]):
-
Granularity. Two levels of granularity are differentiated in the conceptual model: The Training Dataset is used to refer to the collection level, and the Training Data is used to refer to the individual level.
-
Label semantics. The training dataset will not be limited to one classification scheme. External classification schemes should be allowed to be linked into the Training Dataset to accommodate different cases in practice.
-
Light-weight design. The lightweight designed conceptual model has a minimum set of metadata elements, provenance, or quality measures at the collection level instead of at the individual level. This is to facilitate the understanding of the dataset and improve the scalability for communicating large training datasets.
-
Alignment. The modelling of elements in TDs can leverage existing efforts for wide adoption, such as for ISO 19109 Geographic information — Rules for application schema, ISO 19115-1 Geographic information — Metadata — Part 1: Fundamentals, ISO 19157-1 Geographic information — Data quality — Part 1: General requirements, and the OGC Geography Markup Language (GML) Standard. The conceptual model can be aligned with these existing standards and leverage capabilities fulfilled in part by other standards.
-
Quality, bias, and ethics. Elements related to quality, or more specifically, bias that can be used to reduce the errors when using AI/ML. For example, any knowledge of the TD imbalance and mislabeling can be stored in TD quality. In addition, data ethics aims to safeguard the responsible use of TD, and it can be addressed by using the license property in the TD.
-
Changeset. This will be an optional module in TD modelling. Changeset addresses how to capture changes in TD datasets. The change model considers the trend in TD collections to use the crowdsourcing platforms and borrow the change representation from the platforms such as OpenStreetMap.
6.3.2. Class Hierarchy and Inheritance of Properties and Relations
In the TrainingDML-AI conceptual model, the specific elements such as EO training datasets, EO training data, scene label, object label, and pixel label are defined as subclasses of more general higher-level classes. Hence, elements build a hierarchy along specialization / generalization relationships where more specialized elements inherit the properties and relationships of all their super classes along the entire generalization path to the topmost element.
6.3.3. Definition of the Semantics for all Classes, Properties, and Relations
The meanings of all elements defined in the TrainingDML-AI conceptual model are normatively specified in the data dictionary in Clause 8.
6.3.4. Data Integrity, Authenticity, and Non-repudiation
Sometimes training datasets can be downloaded, disseminated, and changed by anyone. The data integrity, authenticity, and non-repudiation are important to ensure unexpected bias propagation and distorted results. Currently the standard focuses on the information modelling, while data dissemination can be enriched with strategies from the general information domain by publishing hashes (e.g., MD5) and public-keys (e.g., RSA) after signing and encrypting.
6.4. Extending TrainingDML-AI
The TrainingDML-AI conceptual model is designed as a universal information model that defines elements and attributes which are useful for a broad range of AI/ML applications. In practical AI/ML applications, the elements within specific TDs will most likely contain attributes which are not explicitly modeled in TrainingDML-AI. Moreover, there might be TD elements which are not covered by the TrainingDML-AI thematic classes.
The model provides an abstract class-based method to support the exchange of such data. Elements not represented by the predefined thematic classes of the model may be modeled and exchanged by extending abstract class.
7. TrainingDML-AI UML Model
The TrainingDML-AI UML model is the normative definition of the TrainingDML-AI Conceptual Model. The tables and figures in this section were software generated from the UML model. As such, this section provides a normative representation of the TrainingDML-AI Conceptual Model.
7.1. ISO dependencies
TrainingDML-AI builds on the ISO 19100 family of standards. The applicable standards are identified in Figure 3. Data dictionaries are included for all the ISO-defined classes explicitly referenced in the TrainingDML-AI UML model. These data dictionaries are provided for the convenience of the user. The ISO standards are the normative source.
The ISO classes explicitly used in the TrainingDML-AI UML model are introduced in Table 1. Further details about these classes can be found in the Data Dictionary within Clause 8.
Class Name |
Description |
Feature |
Abstraction of real world phenomena. |
MD_Band |
Range of wavelengths in the electromagnetic spectrum. |
MD_Scope |
The target resource and physical extent for which information is reported. |
EX_Extent |
Extent of the resource. |
CI_Citation |
Standardized resource reference. |
DataQuality |
Quality information for the data specified by a data quality scope. |
QualityElement |
Aspect of quantitative quality information. |
7.2. Overview of the UML model
The UML model is presented with core concepts in Figure 4, followed by the concrete classes in Figure 5. The following describes the core concepts:
-
AI_TrainingDataset: This concept represents a collection of training samples, i.e. a training dataset.
-
AI_TrainingData: This concept is an individual training sample in a training dataset.
-
AI_Task: This concept is used to identify the task that the training dataset is used for.
-
AI_Label: This concept represents the label semantics for TD.
-
AI_Labeling: This concept provides the provenance of how TD are created.
-
AI_TDChangeset: This concept records types of TD changes between two versions of the training dataset.
-
AI_DataQuality: This concept is associated with a training dataset to document its quality.
The full overview of concrete classes and attributes are presented in Figure 5. Concepts related to the EO AI/ML applications are defined as classes extended from abstract classes. Each core concept with related classes will be described in the rest subsections.